Search CORE

180 research outputs found

Applying A Normalized Compression Metric To The Measurement Of Dialect Distance

Author: Osenova Petya
Simov Kiril
Publication venue: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Publication date: 01/01/2007
Field of study

The paper discusses the application of a similarity metric based on compression to the measurement of the distance among Bulgarian dia- lects. The similarity metric is de ned on the basis of the notion of Kolmo- gorov complexity of a le (or binary string). The application of Kolmogorov complexity in practice is not possible because its calculation over a le is an undecidable problem. Thus, the actual similarity metric is based on a real life compressor which only approximates the Kolmogorov complexity. To use the metric for distance measurement of Bulgarian dialects we rst represent the dialectological data in such a way that the metric is applicable. We propose two such representations which are compared to a baseline distance between dialects. Then we conclude the paper with an outline of our future work

Bulgarian Digital Mathematics Library at IMI-BAS

Using the linguistic knowledge in BulTreeBank for the selection of the correct parses

Author: Osenova Petya
Simov Kiril
Publication venue
Publication date: 01/12/2010
Field of study

Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 163-174. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

DSpace at Tartu University Library

The Role of Language Technologies in Digital Humanities (The Case of Parliamentary Debates)

Author: Petya Osenova
Publication venue: Bulgarian Academy of Sciences, Institute of Mathematics and Informatics
Publication date: 01/09/2023
Field of study

The paper focuses on the use case of parliamentary debates as part of Digital Humanities. First, the ParlaMint project is outlined as a flagship initiative of CLARIN ERIC infrastructure. The project makes content from the national and regional parliaments visible, comparable and accessible for policy making and research. Then, the approaches are considered that have been applied in the creation of 31 corpora from national and regional parliaments. Last but not least, the utility of the multilingual resource is discussed

Directory of Open Access Journals

The data-driven Bulgarian WordNet: BTBWN

Author: Osenova Petya
Simov Kiril
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/01/2018
Field of study

The data-driven Bulgarian WordNet: BTBWNThe paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both - syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval. Oparty na danych WordNet bułgarski: BTBWNW artykule przedstawiono naszą pracę na rzecz jednoczesnej budowy opartego na danych wordnetu dla języka bułgarskiego oraz ręcznie oznaczonego informacjami semantycznymi banku drzew. Takie podejście wymaga uzgodnienia znaczeń słów zarówno w zasobach składniowych, jak i leksykalnych, bez ograniczania znaczeń umieszczanych w wordnecie do tych obecnych w korpusie, jak i odwrotnie. Nasza strategia koncentruje się na identyfikacji znaczeń stosowanych w BulTreeBank, przy czym brakujące znaczenia lematu zostały również zbadane przez zgłębienie większych korpusów. Zidentyfikowane znaczenia zostały zorganizowane w synsety bułgarskiego wordnetu, a następnie powiązane z synsetami Princeton WordNet. Rozmaite rodzaje rzutowań są rozpatrywane pomiędzy obydwoma zasobami w kontekście międzyjęzykowym, a także w odniesieniu do zapewnienia maksymalnej łączności i możliwości uwzględnienia pojęć specyficznych dla języka bułgarskiego. Rzutowanie między dwoma wordnetami (angielskim i bułgarskim) jest podstawą dla aplikacji, takich jak tłumaczenie maszynowe i wielojęzyczne wyszukiwanie informacji

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

Overview of the CLEF 2006 Multilingual Question Answering Track

Author: Ayache Christelle
Forner Pamela
Giampiccolo Danilo
Jijkoun Valentin
Magnini Bernardo
Osenova Petya
Peñas Anselmo
Rocha Paulo
Sacaleanu Bogdan
Sutcliffe Richard
Publication venue
Publication date: 25/10/2006
Field of study

Repositório Comum

bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Author: Angelova Galia
Atanasova Pepa
Hardalov Momchil
Koychev Ivan
Mihaylov Todor
Nakov Preslav
Osenova Petya
Radev Dragomir
Simov Kiril
Stoyanov Ves
Publication venue
Publication date: 04/06/2023
Field of study

We present bgGLUE (Bulgarian General Language Understanding Evaluation), a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety of NLP problems (e.g., natural language inference, fact-checking, named entity recognition, sentiment analysis, question answering, etc.) and machine learning tasks (sequence labeling, document-level classification, and regression). We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark. The evaluation results show strong performance on sequence labeling tasks, but there is a lot of room for improvement for tasks that require more complex reasoning. We make bgGLUE publicly available together with the fine-tuning and the evaluation code, as well as a public leaderboard at https://bgglue.github.io/, and we hope that it will enable further advancements in developing NLU models for Bulgarian.Comment: Accepted to ACL 2023 (Main Conference

arXiv.org e-Print Archive

The Multilingual Question Answering Track at CLEF

Author: Aunimo Lili
Ayache Christelle
Giampiccolo Danilo
Magnini Bernardo
Osenova Petya
Peñas Anselmo
Rijke Maarten de
Sacaleanu Bogdan
Santos Diana
Sutcliffe Richard
Publication venue
Publication date: 06/11/2008
Field of study

Repositório Comum

Overview of the CLEF 2005 Multilingual Question Answering Track

Author: Aunimo Lili
Ayache Christelle
Giampiccolo Danilo
Magnini Bernardo
Osenova Petya
Peñas Anselmo
Rijke Maarten de
Sacaleanu Bogdan
Santos Diana
Sutcliffe Richard
Vallin Alessandro
Publication venue: Centromedia
Publication date: 13/10/2009
Field of study

Repositório Comum